Picture for Adrià Garriga-Alonso

Adrià Garriga-Alonso

Shammie

Among Us: A Sandbox for Agentic Deception

Add code
Apr 05, 2025
Viaarxiv icon

Interpreting Emergent Planning in Model-Free Reinforcement Learning

Add code
Apr 02, 2025
Viaarxiv icon

Hypothesis Testing the Circuit Hypothesis in LLMs

Add code
Oct 16, 2024
Viaarxiv icon

Planning behavior in a recurrent neural network that plays Sokoban

Add code
Jul 22, 2024
Viaarxiv icon

Adversarial Circuit Evaluation

Add code
Jul 21, 2024
Viaarxiv icon

Catastrophic Goodhart: regularizing RLHF with KL divergence does not mitigate heavy-tailed reward misspecification

Add code
Jul 19, 2024
Viaarxiv icon

Investigating the Indirect Object Identification circuit in Mamb

Add code
Jul 19, 2024
Viaarxiv icon

InterpBench: Semi-Synthetic Transformers for Evaluating Mechanistic Interpretability Techniques

Add code
Jul 19, 2024
Figure 1 for InterpBench: Semi-Synthetic Transformers for Evaluating Mechanistic Interpretability Techniques
Figure 2 for InterpBench: Semi-Synthetic Transformers for Evaluating Mechanistic Interpretability Techniques
Figure 3 for InterpBench: Semi-Synthetic Transformers for Evaluating Mechanistic Interpretability Techniques
Figure 4 for InterpBench: Semi-Synthetic Transformers for Evaluating Mechanistic Interpretability Techniques
Viaarxiv icon

Towards Automated Circuit Discovery for Mechanistic Interpretability

Add code
Apr 28, 2023
Figure 1 for Towards Automated Circuit Discovery for Mechanistic Interpretability
Figure 2 for Towards Automated Circuit Discovery for Mechanistic Interpretability
Figure 3 for Towards Automated Circuit Discovery for Mechanistic Interpretability
Figure 4 for Towards Automated Circuit Discovery for Mechanistic Interpretability
Viaarxiv icon

Beyond the Imitation Game: Quantifying and extrapolating the capabilities of language models

Add code
Jun 10, 2022
Viaarxiv icon